[00:15.400 --> 00:23.600]  Today I want to tell you about a project in the works to use data science to help verify elections.
[00:24.270 --> 00:33.090]  And although this project started in earnest only in October, we are going to have a publicly
[00:33.090 --> 00:41.010]  available version of it up and running at Verified Voting for the November election.
[00:41.620 --> 00:47.630]  But I first started thinking about this project, this project of using data to help verify
[00:47.630 --> 00:54.970]  elections back in 2005. And at the time I had what seemed to me a pretty simple idea,
[00:54.970 --> 01:03.370]  which was to take the number of ballots cast in each polling place and compare it to the number
[01:03.370 --> 01:10.350]  of voters who signed in in each polling place. And just to make sure that the one wasn't bigger
[01:10.350 --> 01:19.770]  than the other. And to use that as a check on elections and on whether an election was fair,
[01:19.770 --> 01:27.470]  whether you can verify. How do you verify that the election was run correctly?
[01:28.170 --> 01:32.410]  Well, I thought it was a pretty simple idea and it turned out that actually it was a very
[01:32.410 --> 01:40.850]  complicated idea. Because though the math of it is simple, the technology of it seems simple,
[01:40.850 --> 01:50.630]  it's just subtraction. You have to have the power to get it done. You have to have the power to get
[01:50.630 --> 01:58.390]  the data. You have to have the power to get the data in the right format and to get the data in
[01:58.830 --> 02:08.070]  a timeframe that it's useful. Some thanks are in order. I would like to thank the National Science
[02:08.070 --> 02:14.370]  Foundation for funding and in particular Jeremy Epstein for believing in the project.
[02:14.370 --> 02:23.770]  I would like to thank my team at Portland State, Eric Tsai and Raghu from Guadalupe,
[02:24.410 --> 02:31.610]  and also the staff and faculty at the Hatfield School at Portland State University. I'd like to
[02:31.610 --> 02:38.170]  thank Verified Voting and in particular Marian Schneider for supporting the project and having
[02:38.170 --> 02:44.210]  the vision to make it a reality even on a short notice. And finally, I'd like to thank the
[02:44.210 --> 02:50.990]  organizers of DEFCON for fulfilling a lifelong dream I never knew I had, namely to be active in
[02:50.990 --> 02:59.650]  the chat of my own presentation. So thanks to everyone. Let's start with a real story from a
[02:59.650 --> 03:07.810]  real election. In 2018, there were congressional elections all over the country, including in North
[03:07.810 --> 03:16.990]  Carolina. And in the ninth district, after election day, there were some questions. And
[03:16.990 --> 03:24.810]  there was one piece of simple data analysis in particular that had a big impact. It was a bar
[03:24.810 --> 03:34.810]  chart and it looked like this. What you can see in this chart is that on the absentee ballots,
[03:34.810 --> 03:40.790]  something was very different in Bladen County from all the other counties. Now, anomalies are
[03:40.790 --> 03:47.510]  quite common in elections, and there are all kinds of explanations. They never tell you what
[03:47.510 --> 03:53.930]  happened. They only tell you where you might look. And it's perfectly possible just from looking at
[03:53.930 --> 04:00.150]  the graph that there's a reasonable explanation. So for example, maybe the Republican candidate
[04:00.150 --> 04:07.450]  grew up in Bladen County, lived in Bladen County, and was a contributing citizen in Bladen County
[04:07.450 --> 04:13.470]  for a long time and had made a lot of relationships and had done better there for real,
[04:13.470 --> 04:20.010]  concrete, idiosyncratic reasons. But as people who knew the district in North Carolina looked
[04:20.010 --> 04:25.090]  at it more and more, they couldn't find an explanation like that. They couldn't find an
[04:25.090 --> 04:31.050]  explanation that made sense in terms of the differences that they knew were there between
[04:31.050 --> 04:39.790]  the different counties. And this data analysis was part of what raised the profile of the problem
[04:39.790 --> 04:48.230]  in this contest. And it eventually, an investigation was done. The Democratic candidate
[04:48.230 --> 04:58.650]  pushed for an investigation. And when the investigation was done, then they found that,
[04:58.650 --> 05:06.110]  in Bladen County, and they found enough through this investigation,
[05:06.530 --> 05:14.150]  they found enough compromised ballots that they decided, the North Carolina Board of Elections
[05:14.150 --> 05:22.070]  decided to not to certify the results and to have a do-over. Do-overs in elections are really rare.
[05:22.070 --> 05:28.130]  They are expensive. They're upsetting. And for a while, the citizens of North Carolina's ninth
[05:28.130 --> 05:33.710]  district did not have a representative in Congress, but it was the right thing to do.
[05:33.890 --> 05:40.490]  And that's not a happy story while it was happening, but sometimes that's what verifying
[05:40.490 --> 05:49.010]  an election looks like. Now, there are two really important details about this story.
[05:49.090 --> 05:55.550]  One detail is that the investigation happened before the Board of Elections made its decision
[05:55.550 --> 06:02.270]  about certification. If there had been some kind of investigation after certification,
[06:02.270 --> 06:08.190]  it probably wouldn't have made a difference. And that's because the way elections are,
[06:08.190 --> 06:14.650]  you really have to, at some point, decide who won and move on. Decide who won. Someone takes
[06:14.650 --> 06:23.890]  office. Someone takes power. So the critical time for elections is before the results are certified.
[06:23.890 --> 06:30.810]  The second thing about this story is that the investigation happened and was taken seriously
[06:30.810 --> 06:37.550]  by the Board of Elections because a candidate insisted on it. The losing candidate, in fact,
[06:37.550 --> 06:45.150]  insisted on it. And that's, as a technologist wanting to help use data to verify elections,
[06:45.150 --> 06:51.070]  that's a point we can't afford to forget. Because it's the candidates who have the power
[06:51.070 --> 07:00.250]  and the standing to sue in court or to press a Board of Elections to do an investigation.
[07:00.250 --> 07:10.610]  So as technologists, if we want to have a high impact, we want to somehow dovetail our interest
[07:10.610 --> 07:16.950]  in free and fair elections with the candidate's self-interest, with the candidate's interest
[07:16.950 --> 07:27.570]  in winning. I've done a lot of both partisan and nonpartisan work in elections. And when you're
[07:27.570 --> 07:35.410]  taking the nonpartisan side, it's really tempting to think of partisans as somehow
[07:38.450 --> 07:48.610]  evil or compromised or not as morally grand as nonpartisan. But in fact, in America,
[07:48.990 --> 07:58.950]  and probably I would guess all over the world, the real hard work of holding elections and
[07:58.950 --> 08:04.990]  election administration accountable is done by the people who have the most skin in the game.
[08:04.990 --> 08:12.300]  It's done by the partisans. It's done by the people who are competing to win the election.
[08:12.690 --> 08:19.470]  So the most powerful thing we can do as technologists in applying technology to
[08:19.470 --> 08:25.450]  this particular issue, how do you make sure that investigations that should happen do happen?
[08:27.950 --> 08:35.110]  We do it most effectively when we have in mind how can we help candidates insist
[08:35.110 --> 08:41.230]  on investigations and in fact help losing candidates insist on investigations when those
[08:41.230 --> 08:49.070]  investigations are justified. What we've built and what will be available live online after the
[08:49.070 --> 08:54.190]  November election, actually it will be available beforehand for folks to play around with,
[08:54.190 --> 09:01.810]  but will be there for the November election, is a system that allows a candidate, or really
[09:01.810 --> 09:13.410]  anybody, to go and look at visualizations of data of election results from the election before
[09:13.410 --> 09:22.230]  certification. And it allows them to look at anomalies, anomalies like the one in Bladen
[09:22.230 --> 09:30.210]  County if they exist, and some people might say that this is going to open up a big can of worms
[09:30.210 --> 09:37.330]  because there are lots of anomalies in elections. Elections in my experience are full of anomalies
[09:37.330 --> 09:42.750]  and in my experience most of them are completely, completely legitimate.
[09:45.050 --> 09:57.350]  So, are we encouraging mass hysteria? Here's how I view it. So, the only people,
[09:57.350 --> 10:02.210]  well the people with the most power to push for investigations really are the candidates.
[10:02.510 --> 10:08.770]  And while candidates tend to be in the candidate bubble, and I can tell you from personal
[10:08.770 --> 10:15.190]  experience, you always think you're going to win, you always think you did win,
[10:15.770 --> 10:23.690]  the staff of the campaign and the people who back the candidate both morally and politically
[10:23.690 --> 10:32.350]  and emotionally and with their dollars tend to have a more reasonable view of things.
[10:32.350 --> 10:39.330]  And candidates don't often insist for long on investigations into problems that
[10:39.330 --> 10:48.950]  aren't really problems. I really trust candidates to do that. And the point is that,
[10:48.950 --> 10:55.590]  and I am also willing to take a few candidates who maybe won't do that, I still think it's
[10:55.590 --> 11:02.330]  worthwhile because if the anomaly is for a real reason, an investigation should turn that up.
[11:03.210 --> 11:11.970]  So, my goal is to think about after the election. My goal is to increase the confidence in the
[11:11.970 --> 11:19.310]  election results after the election. And part of building that confidence is the idea that
[11:19.310 --> 11:26.370]  if there are investigations that should have happened, they happened. Here's how the system
[11:26.370 --> 11:37.870]  works. You choose a state. Let's say you choose North Carolina. And then you choose a contest
[11:37.870 --> 11:44.150]  type. So you could look at presidential, you could look at congressional, you could look at state
[11:44.150 --> 11:54.490]  house contests. And once you've done that, then you pick a particular contest or a group of
[11:54.490 --> 12:01.110]  contests. So you could pick all congressional contests or you could pick any particular
[12:01.110 --> 12:11.050]  congressional contest if you had chosen contest type congressional. Then what you get is some bar
[12:11.050 --> 12:19.450]  charts which the system has chosen to present to you. So the system is presenting anomalies,
[12:19.450 --> 12:29.450]  anomalies of interest. And if you ran this on the North Carolina 2018 data, the system would
[12:29.450 --> 12:37.470]  show you as one of the most anomalous charts and most interesting charts, the Bladen County
[12:37.470 --> 12:44.690]  anomaly. So if you look at the top chart, it's in a slightly different order. Bladen is on the left
[12:44.690 --> 12:51.110]  because it is the anomaly. And if you compare it to the original chart that was published
[12:52.030 --> 12:58.710]  by the political science professor and campaign consultant in North Carolina right after the
[12:58.710 --> 13:06.690]  election, you see that's the chart. It's absentee ballots, accepted absentee by mail ballots,
[13:07.850 --> 13:14.450]  and it's by county, and something anomalous happened in Bladen County.
[13:15.480 --> 13:25.090]  So that's a case where the anomaly really makes a difference in the outcome of the contest.
[13:25.550 --> 13:30.010]  And you can see that actually, that it's an important anomaly. If you look
[13:30.530 --> 13:37.660]  underneath the bar chart, you can see it says votes at stake 110, margin 900.
[13:38.760 --> 13:48.440]  This is a measure of how much impact this anomaly has on the margin. So if indeed it's an anomaly
[13:48.440 --> 13:56.240]  that's due to something that could be undone, then that tells you how valuable it might be
[13:56.240 --> 14:03.480]  to the losing candidate, who in this case was Dan McCready. Another anomaly that the system finds
[14:03.480 --> 14:14.080]  for North Carolina 08, sorry, in 2018, is the anomaly chart on the bottom. Here the anomaly is
[14:14.080 --> 14:23.260]  that there seem to be no provisional votes in Randolph County, whereas there are provisional
[14:23.260 --> 14:30.160]  votes in all the other counties. This is for the sixth congressional district.
[14:32.000 --> 14:37.840]  And it's not that important in terms of the overall votes at stake versus the margin,
[14:37.840 --> 14:45.040]  because you can look and see that the votes at stake, just 25 about, and the margin was about
[14:45.040 --> 14:54.500]  37,000. So what's going on here? Well, it turned out that there were provisional votes in Randolph
[14:54.500 --> 14:59.120]  County in this contest, and the North Carolina Board of Elections had them, and they were part
[14:59.120 --> 15:03.100]  of the official results, and they were part of what you would see if you went to look them up
[15:03.100 --> 15:10.060]  online through their usual interface. They just hadn't made it into the data file that the Board
[15:10.060 --> 15:16.100]  of Elections provided for download. And I want to point out that while the Blayton County anomaly
[15:16.100 --> 15:22.220]  was known before we did our work, and, you know, you have every reason to question whether I
[15:22.220 --> 15:29.160]  reverse-engineered the system so it would pop out, this anomaly no one had noticed before,
[15:29.160 --> 15:35.300]  as far as I know. And when we pointed it out to the Board of Elections, they corrected the data
[15:35.300 --> 15:43.960]  file and posted a new one. So this shows that anomaly detection can help Boards of Elections
[15:43.960 --> 15:50.720]  with their process of continuous improvement. So that's a second application of this work.
[15:51.180 --> 15:59.060]  So that's the bar chart part of the system, where the system chooses anomalies to show you.
[15:59.060 --> 16:04.120]  There's another part of the system which allows you just to play around comparing various counts
[16:04.120 --> 16:13.600]  in scatter plots. So, for example, we're still in 2018. This is Florida. If you compare the
[16:13.600 --> 16:21.260]  United States Senate votes cast, total votes in the contest, to governor of Florida,
[16:21.640 --> 16:28.860]  and you compare by county, you get the following chart. And the interesting thing is that there is,
[16:28.860 --> 16:33.420]  you can see that all the counties line up, but one of them is a little bit off,
[16:33.420 --> 16:39.840]  and that county is Broward County. In that county, there was an undervote for U.S. Senate
[16:40.300 --> 16:45.780]  when people investigated. What they found was that the Broward County ballot
[16:46.380 --> 16:56.240]  was poorly designed. It had a design flaw which made it pretty easy for people to inadvertently
[16:56.240 --> 17:03.980]  miss the U.S. Senate contest. And it resulted in a significant undervote. And you can see that on
[17:03.980 --> 17:10.980]  this chart. I want to end by showing you my absolute favorite scatter plot chart of all time.
[17:11.940 --> 17:18.680]  This is from the Philadelphia 65th Ward by precinct in 2011,
[17:18.680 --> 17:24.560]  my election contest against Marge Tartaglione, the famous Marge Tartaglione.
[17:25.500 --> 17:32.660]  And you can see that there is a real outlier. I mean, that's a serious outlier. That's like
[17:32.820 --> 17:42.680]  a hundred votes worth outlier. So if I were Marge Tartaglione, and if the margin had been small,
[17:42.680 --> 17:51.240]  which it wasn't, I would be wanting to know what had happened in that precinct. And even if I just
[17:51.240 --> 17:57.220]  were a person who cared about elections and free and fair elections, I might want to know what
[17:57.220 --> 18:02.300]  happened in that precinct. And if someone had come to me and said, you know what, we saw this outlier
[18:02.300 --> 18:09.180]  and we're going to investigate it. And I hope you have an explanation. I would have said, yes, I have
[18:09.180 --> 18:16.520]  an explanation. That is the precinct where my daughter stood outside all day and said, vote for
[18:16.520 --> 18:24.500]  my mom, vote for my mom, vote for my mom. And this points to a third application of this system,
[18:24.500 --> 18:30.740]  which is to political science. What really makes a difference in voter behavior?
[18:31.700 --> 18:39.760]  One way to figure that out is to look at outliers and to get explanations for outliers.
[18:40.400 --> 18:46.240]  There's always something interesting behind an outlier. Let me say a little bit about the anomaly
[18:46.240 --> 18:53.680]  detection algorithm. So first of all, when I first built just the first working prototype
[18:53.680 --> 19:03.260]  of the system, I used just the simplest anomaly detection algorithm that fell to hand, which was to
[19:05.880 --> 19:12.760]  think of, let's say you're taking a particular contest by county, some certain set of ballots,
[19:12.760 --> 19:20.080]  and then for each county, if you have three candidates, then you have a vector in three
[19:20.080 --> 19:27.680]  space that is the vote totals for those three candidates. Or you can look at the percentage
[19:27.680 --> 19:34.320]  splits. You can have that be your vector in three space. In any case, if you take the vector for each
[19:34.320 --> 19:41.500]  county, you can think of them as points in three space. And then you can just, there's a very simple
[19:41.500 --> 19:50.560]  outlier detection thing called z-score. So you can just calculate for each point how far it is from
[19:50.560 --> 20:00.400]  all the other points, and then apply the z-score to figure out what's the outlier in distance from
[20:00.400 --> 20:08.440]  all the other points. So this is, you know, if you think about it in depth, this is, the z-score is
[20:08.440 --> 20:13.780]  the wrong thing to use because you don't have a normal distribution and stuff like that. But
[20:14.440 --> 20:18.620]  just throwing it together to see if it worked, I ran it on the North Carolina data, and
[20:18.620 --> 20:25.620]  out came the Bladen County anomaly as one of the most anomalous slices in North Carolina.
[20:27.520 --> 20:33.460]  So that's a pretty good proof of concept. Now we are doing something more sophisticated,
[20:33.460 --> 20:38.580]  and we're certainly interested in doing a variety of different things.
[20:38.580 --> 20:46.780]  So the tension here is between wanting to use percentages, because that levels the playing
[20:46.780 --> 20:54.880]  field between counties. Counties are of widely varying size in pretty much every state.
[20:55.380 --> 21:00.660]  Maybe not geographically, but in terms of population. And the same is true even of
[21:00.660 --> 21:10.460]  precincts often or any other subdivision you might use. So the argument for using percents
[21:10.460 --> 21:15.600]  is that it takes away that variability, which isn't what you want to look at anyway.
[21:16.340 --> 21:23.940]  So on the other hand, what really matters in the end is how many votes are involved.
[21:23.940 --> 21:30.580]  So that feels like an argument for using vote totals and not percentages. And the way we're
[21:30.580 --> 21:39.880]  splitting the difference there is that we are using percentages to identify the outlying point.
[21:40.860 --> 21:46.620]  And then once we have... if we have a case where we have a point that is an outlier,
[21:47.300 --> 21:58.660]  then we use the vote totals to calculate how many votes are at stake, meaning right now,
[21:59.780 --> 22:07.140]  what would the change in the margin be if you assume that that outlier count, if you alter it
[22:07.140 --> 22:16.400]  back to fit in better with the other counts? How much does that change the margin? By what
[22:16.400 --> 22:23.180]  percentage of the margin does it change? So what's the ratio of the votes at stake to the margin?
[22:23.180 --> 22:32.320]  If that ratio is one, that means that if this outlier really is not a true good vote count,
[22:32.320 --> 22:37.280]  and if it were changed, it might change the outcome of the contest. So that's
[22:37.280 --> 22:47.040]  clearly important. If that ratio was 1%, then that outlier, it might be interesting, but
[22:47.040 --> 22:52.400]  not from the point of view of would it change the outcome of the contest. So from the point
[22:52.400 --> 22:57.400]  of view of the losing candidate, that's not an interesting anomaly at all, if it's 1% of the
[22:57.400 --> 23:05.660]  margin. So we use this percentage of margin as a way of scoring which anomalies are likely to be
[23:05.660 --> 23:14.620]  of most interest. Let me say a word about what data will actually be available. So back in 2005,
[23:14.620 --> 23:22.980]  when I was thinking about how easy it would be to compare ballots cast to the number of
[23:22.980 --> 23:31.840]  voters checked in at the polling place, one of the most naive things was that I had no idea how hard
[23:31.840 --> 23:41.720]  it was to get election data, never mentioned to get it in time before certification. I had to
[23:41.720 --> 23:48.580]  actually sue the Commonwealth of Pennsylvania to get voter file data. I had to threaten to sue
[23:48.580 --> 23:54.240]  the Board of Elections in Philadelphia to get election results at precinct level in electronic
[23:54.240 --> 24:01.200]  form. And eventually I had to run for the Board of Elections myself, which by the way, I highly
[24:01.200 --> 24:08.380]  recommend, even though it was simultaneously the best and worst experience of my life serving. It
[24:08.380 --> 24:14.860]  was very difficult, but it was important and very satisfying. And the more technologists we have in
[24:14.860 --> 24:23.000]  office, in particular in offices that have some say over the conduct of elections, the better
[24:23.000 --> 24:31.500]  elections are going to be in this country. But what data is available, what format it's in,
[24:31.500 --> 24:39.020]  and how quickly you can get it, all of this varies wildly all over this country. And some
[24:39.020 --> 24:45.260]  states like North Carolina, Virginia do an awesome, awesome job of putting the data out there.
[24:45.260 --> 24:54.040]  And other states don't. And we will collect as fast as we can, as much as we can. And
[24:55.800 --> 25:01.100]  that's a place actually where we also can use help if people want to be part of the collecting
[25:01.500 --> 25:08.880]  that's awesome. While we're on the subject of collecting the data, you'll see if you go visit
[25:08.880 --> 25:16.780]  the code repository on GitHub, you'll see that a good amount of it, most of it in fact, is dedicated
[25:16.780 --> 25:25.080]  to munging the data, to taking the data from whatever format it arrives in and putting it
[25:25.080 --> 25:33.800]  into a common data format so that the analysis algorithms that we have will be applicable to
[25:33.800 --> 25:40.300]  all of the data that we have. That's something actually that I'm really proud of in the project
[25:40.300 --> 25:46.940]  already. It's never easy to take data from different formats and put it into a common
[25:46.940 --> 25:56.060]  format. This makes the process as straightforward as possible. So we are looking for people to
[25:56.060 --> 26:04.780]  contribute to the collection effort, but we are also looking for people to contribute to
[26:04.780 --> 26:11.320]  building the system. So I should say I should also thank the National Science Foundation,
[26:11.320 --> 26:18.180]  which funded the building of the back end. And it's the back end that you'll see on GitHub and
[26:18.180 --> 26:25.140]  that you should feel free to take and use however you like. Well, if you've stuck with this talk
[26:25.140 --> 26:33.100]  this long, then maybe you're even interested in thinking about contributing to the project. We
[26:33.100 --> 26:41.700]  would love that. We would love to have people build visualizations and analysis on top of
[26:41.700 --> 26:52.120]  the standardized data that we can now provide. We are looking for someone or someones to
[26:52.980 --> 27:01.640]  build tools that will pull data from APIs into our common data format and also
[27:01.640 --> 27:09.120]  take the data that we have ourselves and to put it out in the common data format
[27:09.660 --> 27:15.360]  that has been developed by the National Institute of Standards and Technology.
[27:16.160 --> 27:21.960]  We're also looking for people to help with documentation, just making sure our documentation
[27:21.960 --> 27:29.440]  is clear. We're looking for people who are interested in merging in other data sets.
[27:29.440 --> 27:37.020]  There is a lot of potential here to do analysis not based only on election results,
[27:37.520 --> 27:44.620]  but also on other election related data and even data that maybe doesn't seem at first glance to be
[27:44.620 --> 27:51.840]  election related, but I always think of the weather. It's one of these folk truths in
[27:51.840 --> 28:01.160]  elections that weather affects who comes out for elections. There will be data about COVID,
[28:01.160 --> 28:07.080]  there will be data about different ways that different jurisdictions have handled vote by mail.
[28:07.080 --> 28:14.240]  There's going to be all kinds of interesting data and we really would love to find people
[28:14.620 --> 28:23.160]  who want to merge in some of that data and build analysis on top of that combination of data.
[28:23.220 --> 28:28.940]  And finally, if you have experience building a successful open source community,
[28:28.940 --> 28:36.340]  it would be terrific to get your input and get your help in building this
[28:37.060 --> 28:44.320]  as a long live open source community. Because we're focused now on 2020,
[28:44.980 --> 28:47.820]  but there's a lot of potential for growth here.
